Forum:FYI: Corrupted database backups (dumps)
__TOC__ Notes * NOTE 1: This is not an error related with Memory Alpha itself, but with Wikia. * NOTE 2: This is a FYI entry. Context I sent a note to the Wikia support to mention that the database backups are not complete. I just got a reply indicating that this is a known issue that is under investigation and has a ticket on it. Below are the reply from Wikia's support and my original email. Reply from Wikia Oct 26, 10:38 PM: Hello, Thanks for contacting Wikia. The issue for this is due to a bug on Wikia's side where dumps of large wikias are becoming corrupted. We have a bug ticket open with our engineering team at the moment to correct this issue. We apologize for the trouble and thank you in advance for your patience. Director of Technical Support Wikia Community Support Team Every week Wikia releases code to give you new features and bug fixes. Learn what's coming out this week by reading our Technical Updates blog! http://community.wikia.com/wiki/Blog:Wikia_Technical_Updates Starting email Oct 23, 11:38 PM: Hello, I am doing some research and data analysis on the Star Trek's Memory Alpha Wiki. I requested a database dump, which was generated on 2015-09-30. The "Current pages" version had 102212 entries. (#1 below in Output) So I noticed that there was NO entry for "James T. Kirk" in the generated xml file (James T. Kirk). What a surprise! (#2 below in Output) I requested another database dump, which was generated on 2015-10-20. The "Current pages" version has 95874 entries. (#3 below in Output) Still no trace of the missing entry. BTW the page , shows "Pages (All pages in the wiki, including talk pages, redirects, etc.) 154,365" I know that there is an (constant) effort in the community to clean the database, but the new dump does not show the entry for "James T. Kirk". I wonder if is there any condition to filter entries by how old, or how long it has been altered. Or perhaps something with the size of the entry (Cap. Kirk sure has a long history). So I downloaded the full version, which to my surprise has a comprised file much smaller. (#4 below in Output) If so, could I ask for some verification AND to have a database dump that has ALL the Active current pages Appreciated if this could be checked and updated. Thanks and Regards Output # 1 - number of pages (all namespaces) as 2015-09-30 grep "" enmemoryalpha_pages_current.xml | wc -l 102212 # 2 - entries for "James T. Kirk" (none), "James T. Kirk*" (six) # "Jean-Luc Picard" (one) and "Jean-Luc Picard*" (four) StarTrek $ grep " James T. Kirk " enmemoryalpha_pages_current.xml StarTrek $ grep " James T. Kirk" enmemoryalpha_pages_current.xml James T. Kirk (android) James T. Kirk (mirror) James T. Kirk's San Francisco apartment James T. Kirk (fan) James T. Kirk (alternate reality) James T. Kirk (disambiguation) StarTrek $ grep " James T Kirk " enmemoryalpha_pages_current.xml James T Kirk StarTrek $ grep " Jean-Luc Picard " enmemoryalpha_pages_current.xml Jean-Luc Picard StarTrek $ grep " Jean-Luc Picard" enmemoryalpha_pages_current.xml Jean-Luc Picard Collection Jean-Luc Picard (impostor) Jean-Luc Picard (imposter) Jean-Luc Picard # 3 - number of pages (all namespaces) as 2015-10-20 StarTrek $ grep "" enmemoryalpha_pages_current.xml | wc -l 95874 # 4 - Sizes of files StarTrek $ ls -la -rw-r--r-- 1 staff 244,000,837 21 Oct 00:02 enmemoryalpha_pages_current.xml -rw-r--r--@ 1 staff 54,881,769 21 Oct 00:04 enmemoryalpha_pages_current.xml.7z -rw-r--r-- 1 staff 824,462,419 21 Oct 00:00 enmemoryalpha_pages_full.xml -rw-r--r--@ 1 staff 5,745,878 21 Oct 00:02 enmemoryalpha_pages_full.xml.7z DataScientist (talk) 23:11, October 26, 2015 (UTC) Alternative Download Scripts :BTW, note that you can run a process to pull down the DB yourself at any given time. I've put the scripts online at http://github.com/memoryalphaen if you're interested. :I've also put the scripts you previously shared online there (as it was easier to do some twiddling of my own along the way by storing them in a git). -- sulfur (talk) 23:59, October 26, 2015 (UTC) ::Thanks for sharing the code. DataScientist (talk) 01:13, October 27, 2015 (UTC) :I tried the code from GitHub, but the server replies with a 301 error: $ perl get_pages.pl Getting page list... Done with ns-0, now have 51463 page(s). 51463 page(s) to fetch, 5000 at a time, 11 part(s) expected... *** 301 Moved Permanently :After some more research I found the code from WikiTeam WikiTeam links * https://github.com/WikiTeam/wikiteam * https://code.google.com/p/wikiteam/wiki/FAQ * https://code.google.com/p/wikiteam/wiki/NewTutorial * http://archiveteam.org/index.php?title=WikiTeam I used the dump generator.py script as below. $ /usr/bin/python dumpgenerator.py \ --api http://memory-alpha.wikia.com/api.php \ --path ST --xml --curonly --namespace 0 Checking API... http://memory-alpha.wikia.com/api.php API is OK: http://memory-alpha.wikia.com/api.php Checking index.php... http://memory-alpha.wikia.com/index.php index.php is OK ######################################################################### # Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3) # # More info at: https://github.com/WikiTeam/wikiteam # ######################################################################### ######################################################################### # Copyright © 2011-2014 WikiTeam # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU General Public License as published by # # the Free Software Foundation, either version 3 of the License, or # # (at your option) any later version. # # # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU General Public License for more details. # # # # You should have received a copy of the GNU General Public License # # along with this program. If not, see . # ######################################################################### Analysing http://memory-alpha.wikia.com/api.php Trying generating a new dump into a new directory... Loading page titles from namespaces = 0 Excluding titles from namespaces = None 1 namespaces found Retrieving titles in the namespace 0 ............ 51454 titles retrieved in the namespace 0 Titles saved at... memory_alphawikiacom-20151116-titles.txt 51454 page titles loaded Retrieving the XML for every page from "start" "Big" Ed Cooper, 1 edit "Enterprise" Flight Manual, 1 edit "Enterprise Flight Manual", 1 edit $2, 1 edit %s, 1 edit 'Bu Tones, 1 edit 'Owon, 1 edit 'Til Death Do Us Part, 1 edit 'Til Death Do Us Part (episode), 1 edit Downloaded 10 pages 'aucdet IX, 1 edit (A Little Adventure...) ... Goes a Long Way! The Conclusion!, 1 edit ...Nor the Battle to the Strong, 1 edit ... Gone!, 1 edit ... Let's Kill All the Lawyers!, 1 edit ... Like a Woman Scorned!, 1 edit ... The Only Good Klingon..., 1 edit ... Where No One Has Gone Before!, 1 edit .38 Special, 1 edit .38 police special, 1 edit Downloaded 20 pages -- DataScientist (talk) 00:25, November 16, 2015 (UTC) same problem Having the same problem for a year. Pages are always missing from the xml files. -- Surprisingly, I've also experienced with other wikis this big difference in size between just the current pages (supposedly smaller, but noticeably bigger), and the full page history (supposedly much bigger, but four times smaller). As a sake of an example, you can look at muppets wiki(http://muppet.wikia.com/wiki/Special:Statistics). But, if you later decompress them, you see that the full page history is considerably much bigger (7x times) than the current pages one.